This document is based on Gemma’s code.
This is a template for fitting sdmTMB to GOA bottom trawl data. For each species, it fits sdmTMB to life stages (juveniles or adults) to predict CPUE in number of individuals per km\(^{2}\). This predicted CPUE is then scaled up to box area in Atlantis, to get numbers of individuals per box and proportion of the total by box, which is what we need to initialise Atlantis. This same code will be run for biomass pool species too, except for those we will not split inot juveniles and adults, and we will use CPUE in biomass.
This workflow does not include British Columbia: biomasses and/or numbers of individuals for the Atlantis boxes in the Canadian part of the model will be estimated separately by using this code, with some adjustments, on DFO groundfish bottom trawl data.
This workflow is based on the following assumptions:
select <- dplyr::select
load("data/cpue.atf.A.Rdata")
race_data <- as_tibble(dat_stage)
This is CPUE data for ATF as I received from Gemma (who in turn got it from Kirstin), except the 10-cm size bins have been aggregated into juveniles (J) and adults (A) based on length at 50% maturity. This aggregation occurred at haul level. There are a lot of fields, so let’s select what we need and drop the rest.
Note: for lat and lon, for now I am using the start values for each tow, but it would be easy to get the coordinates for the midpoint of the tow.
fields <- c("YEAR",
"HAULJOIN",
"LAT",
"LON",
"DEPTHR",
"SST",
"TEMPR",
"SPECIES_CODE",
"CN",
"BIOM_KGKM2",
"NUM_KM2",
"STAGE")
race_data <- race_data %>% select(all_of(fields)) %>% set_names(c(
"year",
"hauljoin",
"lat",
"lon",
"depth",
"stemp",
"btemp",
"species_code",
"name",
"biom_kgkm2",
"num_km2",
"stage"))
Take a quick look at the data spatially.
# coast for plotting
load("data/goa_coast.Rdata")
coast_sf <- st_as_sf(coast) # turn coast to sf
ggplot()+
geom_point(data = race_data, aes(lon, lat, colour = log1p(num_km2)), size = 1.5)+
scale_colour_viridis_c()+
geom_polygon(data = coast, aes(x = long, y = lat, group = group), colour = "black", fill = "grey80")+
theme_minimal()+
facet_wrap(~year, ncol = 2)+
labs(title = paste(race_data$name,"CPUE from GOA bottom trawl survey - stage:", race_data$stage, sep = " "))
## Regions defined for each Polygons
Take a quick look at time series of total numbers CPUE from raw data
num_year <- race_data %>% group_by(year) %>% summarise(num = sum(log1p(num_km2)))
ggplot(num_year, aes(year, log(num)))+
geom_point()+
geom_path()+
theme_minimal()+
labs(title = paste(race_data$name,"total GOA CPUE from bottom \n trawl survey - stage:", race_data$stage, sep = " "))
The above is across the entire GOA.
Kirstin’s CPUE data includes empty hauls (i.e. zero catches). However, in the step where aggregate the size bin data into life-stage data (not included in this code), we loose those empty hauls (as BIN will be NA for a haul whith zero catch in Kirstin’s data set, which causes zero cathces to be dropped when I join by HAULJOIN). So, we need to add them back to the RACE data here. To do this, I take haul information from the “Haul Descriptions” data set on AKFIN. I then subtract the RACE hauls with catch in race_data from the “Haul Descriptions” list to see which hauls were empty for this particular species / life stage. Then pad these empty hauls with zero CPUEs, and attach them to race_data.
load("data/hauls.Rdata") # as accessible on AKFIN Answers with no further modifications
data_hauls <- levels(factor(race_data$hauljoin))
zero_hauls <- setdiff(levels(factor(hauls$Haul.Join.ID)), data_hauls) # assuming that if there are no records from a haul, the catch in that haul was 0 for this species
# make a data frame to bind by row
zero_catches <- hauls %>% filter(Haul.Join.ID %in% zero_hauls) %>%
select(Year, Haul.Join.ID, Ending.Latitude..dd., Ending.Longitude..dd., Bottom.Depth..m., Surface.Temperature..C., Gear.Temperature..C.) %>%
mutate(species_code = rep(NA, length(Year)),
name = rep(NA, length(Year)),
biom_kgkm2 = rep(0, length(Year)),
num_km2 = rep(0, length(Year)),
stage = rep(NA, length(Year))) %>%
set_names(names(race_data))
# attach by row to race_data
race_data <- rbind(race_data, zero_catches)
# ditch hauls with empty lat or lon
race_data <- race_data %>% filter(!is.na(lat) | !is.na(lon))
# and with NA depths
race_data <- race_data %>% filter(!is.na(depth))
This is the mesh that the sdmTMB algorithm uses to estimate spatial autocorrelation. Using 450 knots for the atual models, since this is a large area. More than that becomes computationally demanding. Using 100 here for demonstration purposes. Tested with 700 and results are very similar to 450.
Note: SPDE = Stochastic Partial Differential Equations approach. Some material can be found here, but basically it is a way of calculating the position of the mesh knots.
race_spde <- make_mesh(race_data, c("lon", "lat"), n_knots = 300) # usually 450
plot(race_spde)
Check out the distribution of the number of individuals response variable.
hist(race_data$num_km2, breaks = 30)
hist(log1p(race_data$num_km2), breaks = 30)
Proportion of zeroes in percentage.
length(which(race_data$num_km2 == 0))/nrow(race_data)*100
## [1] 28.02988
Try running a model with smooth term for depth. Using 5 knots for the smooth - but this is arbitrary and a range of values could be tested. As a note, I am not scaling depth here. The reason is that depth has a different range in the data and the prediction grid, and thus scaled values have different meaning between the two.
Model type: the distribution of the response variable plotted above should give a sense of what model is most appropriate. CPUE data for many of these species resemble a Tweedie distribution when log-transformed, so we use a Tweedie model with a log link. Some groups may warrant a different model, and this will be evaluated case-by-case depending on convergence issues, distribution of model residuals, and model skill metrics (see below).
Check information on model convergence. From the nlminb help page we know that an integer 0 indicates succesful convergence. Additional information on convergence can be checked with m_depth$model$message. According to the original PORT optimization documentation, “Desirable return codes are 3, 4, 5, and sometimes 6”.
if(m_depth$model$convergence == 0){print("The model converged.")} else {print("Check convergence issue.")}
## [1] "The model converged."
m_depth$model$message
## [1] "relative convergence (4)"
Check out model residuals.
race_data$resids <- residuals(m_depth) # randomized quantile residuals
hist(race_data$resids)
And QQ plot.
qqnorm(race_data$resids)
abline(a = 0, b = 1)
Plot the response curve from the depth smooth term.
plot(m_depth$mgcv_mod, rug = TRUE)
Finally, plot the residuals in space. If residuals are constantly larger/smaller in some of the areas, it may be sign that the model is biased and it over/underpredicts consistently for some areas. Residuals should be randomly distributed in space. We need to read in the Atlantis BGM file to do that, as we need the right projection.
Read in BGM and coast.
atlantis_bgm <- read_bgm("data/GOA_WGS84_V4_final.bgm")
race_sf <- race_data %>% st_as_sf(coords = c(x = "lon", y = "lat"), crs = "WGS84") %>% st_transform(crs = atlantis_bgm$extra$projection) # turn to spatial object
ggplot()+
geom_sf(data = race_sf, aes(color = resids, alpha = .8))+
scale_color_viridis()+
geom_sf(data = coast_sf)+
theme_minimal()+
labs(title = paste(race_data$name,"model residuals in space - stage:", race_data$stage, sep = " "))+
facet_wrap(~year, ncol = 2)
Take a grid (which must contain information on the predictors we used to build the model) and predict the biomass index over such grid based on the predictors.
Read in the Atlantis prediction grid (10 km) modified in Atlantis_grid_covars.R (code not included here).
atlantis_boxes <- atlantis_bgm %>% box_sf()
Important: depth in the RACE data is a positive number. Depth in the prediction grid we obtained from the ETOPO rasters is a negative number. When we use depth as predictor for in our regular grid, make sure depth is a positive number for consistency with the model variable, or else everything will be upside-down. This was done in the script that produces the prediction grid, so depth is positive.
load("data/atlantis_grid_depth.Rdata")
#atlantis_grid_depth <- atlantis_grid_depth %>% mutate(depth = -depth) # making sure that depth has the same sign/orientation in the data and prediction grid
# add coordinate columns
atlantis_coords <- atlantis_grid_depth %>% st_as_sf(coords = c("x", "y"), crs = atlantis_bgm$extra$projection) %>%
st_transform(crs = "+proj=longlat +datum=WGS84") %>% dplyr::select(geometry)
atlantis_grid <- cbind(atlantis_grid_depth, do.call(rbind, st_geometry(atlantis_coords)) %>%
as_tibble() %>% setNames(c("lon","lat")))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
paste("Positive depths are:", length(which(atlantis_grid$depth>0)), "out of:", nrow(atlantis_grid_depth), sep = " ") # Write out a check that depths are positive (few negatives are OK - they are on land - I'll fix it but it should not matter as island boxes will be boundary boxes in Atlantis so predictions will not matter for those)
## [1] "Positive depths are: 6237 out of: 6259"
# add year column
all_years <- levels(factor(race_data$year))
atlantis_grid <- atlantis_grid[rep(1:nrow(atlantis_grid), length(all_years)),]
atlantis_grid$year <- as.integer(rep(all_years, each = nrow(atlantis_grid_depth)))
Visualise the prediction grid.
coast_tmp <- map("worldHires", regions = c("Canada", "USA"), plot = FALSE, fill = TRUE)
coast_tmp <- coast_tmp %>% st_as_sf()
atlantis_grid %>% filter(year == 1984) %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
ggplot()+
geom_sf(size = 0.1)+
geom_sf(data = coast_tmp)+
coord_sf(xlim = c(-172,-128), ylim = c(50, 61))+
theme_minimal()+
labs(title = "Prediction grid")
Make SDM predictions onto new data from depth model. Back-transforming here
predictions_race <- predict(m_depth, newdata = atlantis_grid, return_tmb_object = TRUE)
atlantis_grid$estimates <- exp(predictions_race$data$est) #Back-transforming here
atlantis_grid_sf <- atlantis_grid %>% st_as_sf(coords = c("x", "y"), crs = atlantis_bgm$extra$projection) # better for plots
Not plotting most of Canada because the predictions look terrible (due to not having biomass data from there in this model).
ggplot()+
geom_sf(data = subset(atlantis_boxes, box_id < 92), aes(fill = NULL))+
geom_sf(data = subset(atlantis_grid_sf, box_id < 92), aes(color=log1p(estimates)))+ # taking the log for visualisation
geom_sf(data = coast_sf, colour = "black", fill = "grey80")+
scale_color_viridis(name = expression(paste("Log(CPUE) num ", km^-2)))+
theme_minimal()+
labs(title = paste(race_data$name,"predicted CPUE - stage:", race_data$stage, sep = " "))+
facet_wrap(~year, ncol = 2)
Attribute the predictions to their respective Atlantis box, so that we can take box averages.
atlantis_grid_means <- atlantis_grid %>% group_by(year, box_id) %>%
summarise(mean_estimates = mean(estimates, na.rm = TRUE)) %>% ungroup()
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
# join this with the box_sf file
predictions_by_box <- atlantis_boxes %>% inner_join(atlantis_grid_means, by = "box_id")
See estimates per box for all years. Silence boundary boxes as they throw the scale out of whack (and they do not need predictions).
predictions_by_box <- predictions_by_box %>% rowwise() %>% mutate(mean_estimates = ifelse(isTRUE(boundary), NA, mean_estimates))
ggplot()+
geom_sf(data = predictions_by_box[predictions_by_box$box_id<92,], aes(fill = log1p(mean_estimates)))+ # taking the log for visualisation
scale_fill_viridis(name = expression(paste("Log(CPUE) num ", km^-2)))+
theme_minimal()+
geom_sf(data = coast_sf, colour = "black", fill = "grey80")+
facet_wrap(~year, ncol = 2)+
labs(title = paste(race_data$name, "mean predicted CPUE by Atlantis box - stage:", race_data$stage, sep = " "))
Plot the raw data again for comparison.
ggplot()+
geom_point(data = race_data, aes(lon, lat, colour = log1p(num_km2)), size = 1.5, alpha = .5)+ # taking the log for visualisation
scale_colour_viridis_c(name = expression(paste("Log(CPUE) num ", km^-2)))+
geom_polygon(data = coast, aes(x = long, y = lat, group = group), colour = "black", fill = "grey80")+
theme_minimal()+
facet_wrap(~year, ncol = 2)+
labs(title = paste(race_data$name,"CPUE from GOA bottom trawl survey - stage:", race_data$stage, sep = " "))
## Regions defined for each Polygons
Have a look at CPUE by depth. This is rough and quick, keep in mind that most tows happen below 300 m, so the sample is not equal between depths.
ggplot(data = race_data, aes(x = depth, y = log1p(num_km2), color = log1p(biom_kgkm2)))+
scale_color_viridis()+
geom_point()+
theme_minimal()+
labs(title = "CPUE by depth")
Plot data and predictions distributions.
ggplot(data = race_data, aes(log1p(num_km2)))+
geom_histogram(colour = "black", fill = 'grey80')+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = predictions_by_box, aes(log1p(mean_estimates)))+
geom_histogram(colour = "black", fill = 'grey80')+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 224 rows containing non-finite values (stat_bin).
Now calculate means of the predictions for the entire study period. Doing it by taking 1984-2019 averages for each Atlantis box.
means_all_years <- predictions_by_box %>% group_by(box_id, area, boundary) %>% summarise(all_years_numkm2 = mean(mean_estimates)) %>% ungroup()
## `summarise()` has grouped output by 'box_id', 'area'. You can override using the `.groups` argument.
ggplot()+
geom_sf(data = means_all_years[means_all_years$box_id < 92,], aes(fill = log1p(all_years_numkm2)))+ # log for visualisation
scale_fill_viridis(name = expression(paste("Log(CPUE) num ", km^-2)))+
geom_sf(data = coast_sf, colour = "black", fill = "grey80")+
theme_minimal()+
labs(title = paste(race_data$name, "mean predicted CPUE by Atlantis box (1984-2019) - stage:", race_data$stage, sep = " "))
Trying to evaluate model skill by having a look at how well model predictions align with observations.
Since this is a spatially-explicit approach, we need observations and predictions at the same location. We use the locations of all RACE hauls as a prediction grid.
#make a prediction grid from the race data itself
race_grid_tmp <- race_data %>% dplyr::select(lon, lat, depth)
# add year
race_grid <- race_grid_tmp[rep(1:nrow(race_grid_tmp), length(all_years)),]
race_grid$year <- as.integer(rep(all_years, each = nrow(race_grid_tmp)))
# predict on this grid
predictions_at_locations <- predict(m_depth, newdata = race_grid, return_tmb_object = TRUE)
race_grid$predictions <- exp(predictions_at_locations$data$est) # back-transforming here
Now join by year and coordinates to have predictions at the sampling points.
race_corr <- race_data %>% left_join(race_grid, by = c("year", "lat", "lon"))
paste0("Pearson's coef observations vs predictions: ", cor(race_corr$num_km2, race_corr$predictions, use = "everything", method = "pearson"))
## [1] "Pearson's coef observations vs predictions: 0.604843196098661"
Plot.
ggplot(race_corr, aes(x = log1p(predictions), y = log1p(num_km2)))+ # log for visualisation
geom_point(aes(color = depth.y))+
scale_color_viridis()+
geom_abline(intercept = 0, slope = 1)+
theme_minimal()+
facet_wrap(~year, scales = "free")+
labs(title = paste(race_data$name, "observed vs predicted CPUE. Stage: ", race_data$stage, sep = " "))
These models often underpredict zeroes.
Calculate RMSE between predicted and observed values.
paste("RMSE:", sqrt(mean(race_corr$predictions - race_corr$num_km2)^2), " num km-2", sep = " ") ### traditional rmse metric, in units num km2
## [1] "RMSE: 218.392877209777 num km-2"
Normalised RMSE.
rmse_cv <- sqrt(mean((race_corr$predictions - race_corr$num_km2)^2))/(max(race_corr$num_km2)-min(race_corr$num_km2))*100 #### normalised rmse, expressed as a % of the range of observed biomass values, sort of approximates a coefficient of variation
paste("Normalised RMSE:", paste0(rmse_cv, "%"), sep = " ")
## [1] "Normalised RMSE: 1.64806145198326%"
The current estimated CPUE is in num km\(^{-2}\). So, just we just turn that into fish per box (biomass pools groups will follow the same workflow except the model will predict biomass CPUE). Remember that the area is in m\(^2\) for the boxes, so need to divide by 1,000,000.
Do this separate for Alaska and Canada, since we are using different data. Total biomass to calculate only for the respective regions, as well as box biomass. Proportional biomass (S1-S4), instead, makes sense only for the entire model domain, so that will be in another script.
means_all_years <- means_all_years %>% mutate(numbers = all_years_numkm2*area*1e-06)
means_alaska <- means_all_years %>% filter(box_id<92)
means_alaska %>% select(box_id, all_years_numkm2, numbers) %>% st_set_geometry(NULL) %>% kable(align = 'lccc', format = "markdown",
col.names = c("Box", "CPUE (num km-2)", "Number of individuals"))
| Box | CPUE (num km-2) | Number of individuals |
|---|---|---|
| 0 | NA | NA |
| 2 | NA | NA |
| 3 | 71.67488 | 70244.22 |
| 4 | 408.08732 | 849842.70 |
| 5 | 238.70529 | 633444.63 |
| 6 | 68.71140 | 140832.61 |
| 7 | 391.61000 | 1303221.59 |
| 8 | NA | NA |
| 9 | 665.90480 | 1677392.24 |
| 10 | 278.75775 | 103345.37 |
| 11 | NA | NA |
| 12 | 290.89248 | 1456810.39 |
| 13 | 461.77555 | 4544434.56 |
| 14 | 34.49956 | 60608.94 |
| 15 | 320.86854 | 449570.87 |
| 16 | 137.49662 | 483539.41 |
| 17 | 1002.52364 | 4073014.31 |
| 18 | 1628.12299 | 13178373.73 |
| 19 | 49.78639 | 67208.30 |
| 20 | 730.42996 | 870730.47 |
| 22 | 222.41412 | 551668.80 |
| 23 | 763.78769 | 1354785.14 |
| 24 | 2145.18402 | 18023718.28 |
| 25 | 628.55405 | 5776516.54 |
| 26 | 383.13961 | 896954.45 |
| 27 | 1353.49600 | 7528659.99 |
| 28 | 709.47053 | 2136113.04 |
| 29 | 2489.15204 | 18461040.31 |
| 30 | NA | NA |
| 31 | 74.87437 | 438571.38 |
| 32 | 3114.17642 | 3487862.59 |
| 33 | 1900.14221 | 7397539.89 |
| 34 | 4855.99297 | 21673351.66 |
| 35 | 183.23165 | 1262863.47 |
| 36 | 3843.18607 | 3053974.78 |
| 37 | 3456.21192 | 25612776.29 |
| 38 | 177.86800 | 659445.35 |
| 39 | 5371.18764 | 30660806.52 |
| 41 | 1773.66767 | 1155342.64 |
| 42 | 3942.26414 | 16548878.46 |
| 43 | 961.36196 | 1863246.30 |
| 44 | 357.90729 | 626049.40 |
| 45 | 1050.95928 | 7455736.99 |
| 46 | 56.36226 | 303324.85 |
| 47 | 1219.10964 | 6915123.69 |
| 48 | 4045.50666 | 68768165.45 |
| 49 | 697.81002 | 2447467.68 |
| 50 | 1065.48913 | 4672353.57 |
| 51 | 5195.14388 | 34468435.41 |
| 52 | 2608.02951 | 9936670.10 |
| 53 | NA | NA |
| 54 | 1817.57924 | 1347874.89 |
| 55 | 1507.43139 | 16840431.44 |
| 56 | 859.87637 | 4367052.91 |
| 57 | 196.89000 | 22786.58 |
| 58 | NA | NA |
| 59 | 35.26749 | 77595.55 |
| 60 | 948.81568 | 878898.46 |
| 61 | 1025.71504 | 1824861.23 |
| 62 | NA | NA |
| 64 | 804.38019 | 4371288.00 |
| 65 | 300.37290 | 317654.37 |
| 66 | 1226.54630 | 1305837.60 |
| 67 | 1286.32571 | 7714663.67 |
| 68 | 207.42355 | 169796.87 |
| 69 | NA | NA |
| 70 | 216.97042 | 189219.08 |
| 71 | 861.29883 | 4164707.20 |
| 72 | 29.37201 | 54615.80 |
| 73 | 336.63949 | 523770.45 |
| 74 | 913.31561 | 4927073.64 |
| 75 | 875.12682 | 3730757.14 |
| 76 | 809.62642 | 2496257.56 |
| 77 | 44.45419 | 30391.01 |
| 78 | 887.46226 | 3012042.71 |
| 79 | 1187.81963 | 13464164.73 |
| 80 | 176.22300 | 164887.78 |
| 81 | NA | NA |
| 82 | 1427.72888 | 2576476.97 |
| 83 | 1436.26855 | 3348315.02 |
| 84 | 380.71048 | 663183.47 |
| 85 | 51.42577 | 183209.78 |
| 87 | 999.64064 | 1179667.12 |
| 88 | 430.65834 | 3083349.33 |
| 89 | NA | NA |
| 90 | 1184.82366 | 5215841.54 |
| 91 | 632.00426 | 984064.24 |